12/16/2019

Introduction

The Problem

People in major cities tend to live in clusters somewhat defined by their root cultures.
One could imagine that under this settings, you can expected to find that restaurants will also clustered, but do they?

  • Do restaurants really exhibit a geospatial clustering by genre?

  • Does the rating of restaurants also clustered geospatially?

Introduction

The method

  • Yelp Fusion api was used to pull data from Yelp.com
    • This project ultilized the yelpr package and have made own modified functions to efficiently request data from Yelp Fusion base on location and genre.
  • Gaussian Mixture Model was used to calculate the geospatial clusters exhibit by the restaurants under the same genre
    • The actual implementation was fulfilled by mclust package on CRAN.
  • Leaflet package was used to render all the maps in this project.

  • Specail thanks to Samuel Luo for helping with shiny app implementation.

Data Wrangling

The Scope

  • Two Major Cities:
    • Boston
    • San Francisco
  • Two Biggest Genre:
    • Chinese
    • Italian
  • After Initial Wrangling(source code can be found in “./data/data_acquisition.R”)
    • 640 restaurant from the city of Boston
    • 1440 restaurants from the city of San Francisco

Data Wrangling

Original Data

After gathering data from Yelp.com, the original dataset contains 22 column.

For this project, we mainly focused on:

  • Coordinates of the restaurant
  • Ratings of the restaurant

At first, reviews are also considered to perform text mining and text analysis. However, The Yelp api can only return the top 3 reviews of a given restaurant and each reviews are trimmed to the 1st 160 characters including space. After a few try, it is almost no point at all to do the text analysis since at this level, the text sentiment is highly bias towards the 5 star ratings.

EDA

Initial plots of the data

Visualizing these data on a map will give you very little idea of how the clustered:

Boston

San Francisco

EDA

Look at the distribution

To get a sense of the overall performance of both cites, we examine the distribution of their ratings as well as the distance towards the city center.

First glance showed that the distribution of ratings is very similar among the two cities

EDA

Look at the distribution

If we look at the distribution of distance towards the city center factored by ratings, we will find out that the two cities differ in a interesting way.

Boston

San Francisco

In Boston, for both Italian and Chinese Restaurants, when approaching to the city center, food tends to be “,strong(”Okay“),”. When moving away from the city center, it becames harder to determine whether it is actually good or not.

In San Francisco, Chinese food seems to get better when you approaching towards the city center while Italian food’s quality stays roughly the same

Clustering

Apply Gaussian Mixture

Gaussian Mixture model was widely used to do clustering analysis. Not only can it provide a soft boundary assignment but can also detect a highly concentrated anomaly with in a cluster.

Below are the results given by mclust package

Boston Chinese Restaurant Clustering

Boston Italian Restaurant Clustering

San Francisco Chinese Restaurant Clustering

San Francisco Italian Restaurant Clustering

Clustering

Results

As can be seen in the plots, there are highly concentrated areas with in the cluster, The small blue cluster in the Boston Chinese Restaurant map is actually the Chinatown area. Also from the Italian restaurant cluster map, North End also has a really high concentration of Italian restaurants.

If we plot these clusters on the map, it would be clearer what part of the city these cluster belongs to.

Boston

Restaurants are clickable. Toggle base layer to ‘None’ to observe the cluster. Cursor on object will provide more information

San Francisco

Restaurants are clickable. Toggle base layer to ‘None’ to observe the cluster. Cursor on object will provide more information

Analysis

Rating by cluster

Based on the clustering we could re-examine the distribution of ratings conditioned on the cluster.

Boston Chinese Restaurant

Boston Italian Restaurant

San Francisco Chinese Restaurant

San Francisco Italian Restaurant

Analysis

Distance to Cluster Center by Rating

One interesting observation is that in Boston, Chinatown is actually not a very good place to eat Chinese food, The furthest cluster: “Allston” on the other hand is a much better place to have Chinese food. However, when you consider Italian food, North End is actually a good choice.

We could also analyse the distribution of restaurants to its cluster center separated by ratings. Because the cluster itself varies in radius. The relative distance are calculated by dividing the distance by the standard deviation within its cluster.

Boston Chinese Restaurant

Boston Italian Restaurant

San Francisco Chinese Restaurant

San Francisco Italian Restaurant

Conclusion

Results Analysis

As we can see from the map and plots, restaurants are highly clustered in both city: Boston and San Francisco. However, unlike what people normally thought, the best place for the authentic cuisine might not always be at the place where the specific demographic group lives.(In boston, turns out that Allston has the best Chinese food when Chinatown on the other hand, has the lowest ratings).

Conclusion

Speculation

One speculation is that, the demand for authentic food might actually driven by the concentration of international population in the city. Also take Boston as an Example. The highest concentration of Chinese population(excluding American born Chinese) might actually be at Allston because of the Boston University! Chinese student population here might actually be the driven force of the authenticity of foods which resembles their home on the other side of the Earth.